Short-form adaptive measure of financial toxicity from the Economic Strain and Resilience in Cancer (ENRICh) study: Derivation using modern psychometric techniques

Objectives This study sought to evaluate advanced psychometric properties of the 15-item Economic Strain and Resilience in Cancer (ENRICh) measure of financial toxicity for cancer patients. Methods We surveyed 515 cancer patients in the greater Houston metropolitan area using ENRICh from March 2019 to March 2020. We conducted a series of factor analyses alongside parametric and non-parametric item response theory (IRT) assessments using Mokken analysis and the graded response model (GRM). We utilized parameters derived from the GRM to run a simulated computerized adaptive test (CAT) assessment. Results Among participants, mean age was 58.49 years and 278 (54%) were female. The initial round factor analysis results suggested a one-factor scale structure. Negligible levels of differential item functioning (DIF) were evident between eight items. Three items were removed due to local interdependence (Q3>+0.4). The original 11-point numerical rating scale did not function well, and a new 3-point scoring system was implemented. The final 12-item ENRICh had acceptable fit to the GRM (p<0.001; TLI = 0.94; CFI = 0.95; RMSEA = 0.09; RMSR = 0.06) as well as good scalability and dimensionality. We observed high correlation between CAT version scores and the 12-item measure (r = 0.98). During CAT, items 2 (money you owe) and 4 (stress level about finances) were most frequently administered, followed by items 1 (money in savings) and 5 (ability to pay bills). Scores from these four items alone were strongly correlated with that of the 12-item ENRICh (r = 0.96). Conclusion These CAT and 4-item versions provide options for quick screening in clinical practice and low-burden assessment in research.


Introduction
The use of advanced treatments and medical care facilities to diagnose and treat cancer improved outcomes and prolonged the life of patients [1,2]. However, the cost of these advanced treatments are increasing and patients themselves are paying a greater proportion of the costs of treatment [3,4]. Approximately 48% -73% of cancer survivors experience adverse financial effects of cancer treatment whether directly from costs of treatment or indirectly from lost income or ability to work [5]. In the United States, the greatest financial burden of cancer treatment is experienced by adults aged between 18 and 64 years [6].
In this study, we use the term financial toxicity to describe the negative effects on cancer patients' subjective and material experience resulting from the cost of cancer care [7,8]. The potential consequences of such adverse impact may manifest as material losses, psychological distress, and maladaptive coping strategies [8]. Therefore, financial toxicity should be assessed within material, psychological, and behavioral domains [6,9].
Though multi-level strategies to prevent and mitigate financial toxicity have been proposed [10], the effectiveness of such strategies is dependent on accurately identifying individuals at high risk of financial toxicity and measuring the severity of financial toxicity. Currently, there are three validated financial hardship patient-reported outcome measures (PROMs) of financial toxicity. The COmprehensive Score for financial Toxicity-Functional Assessment of Chronic Illness Therapy (COST-FACIT) tool with 12 items measures general financial toxicity [11,12]. The InCharge Financial Distress/Financial Well-Being scale with 8 items focuses on psychological distress-based financial hardship [13]. Our group developed the 15-item Economic Strain and Resilience in Cancer (ENRICh) measure to comprehensively encompass material, psychological, and behavioral coping dimensions(see S1 Table) [14,15]. The scoring range of 0 to 10 for items indicates the none and the highest burden, respectively. The previous studies on patients with stage I-IV cancer show the overall mean of financial hardship score measured by ENRICh was 3.56 (sd = 2.64) [15], and the mean score for socioeconomically disadvantaged patients was 2.3 times higher [14]. However, to date, the advanced psychometric properties of this measure and its potential suitability for brief computational measurement using computerized adaptive testing (CAT) has not been assessed.
Computerized adaptive testing refers to the process that the computer automatically administers an item from the item bank most relevant to the questionnaire-taker based on his/her response to the last item [16]. Previous research has demonstrated the CAT approach successfully shortens the length of fixed scale as much high as 82% by reducing the number of items to be administered [17][18][19][20], which makes more efficient and personalized PROM assessment possible [17]. Computerized adaptive testing is made possible by the application of item response theory, a probabilistic framework that can be used to assess the advanced measurement properties of questionnaires [21,22], administer the personalized measure to questionnaire-taker, and facilitate the development of short and effective version of PRO measure by detecting items with most information. The methodology is widely used in educational assessment and is beginning to be used more in health assessment [17]. The previous studies on IRT-based CAT tools delivery have shown CAT algorithms' promising application prospects and possibilities through the construction of goal-oriented implementation platforms [23].
In this study, we aimed to apply advanced psychometric methods to data collected using the ENRICh to assess the suitability of the scale for CAT-based assessment. In doing so, we will evaluate the measure's advanced psychometric properties and assess the potential to create a shorter version of the measure to screen cancer patients at high risk of financial toxicity. The resulting shorter version of the measure will reduce the respondent burden for patients with cancer and can be used with confidence in clinical practice.

Participants
As a part of the Economic Strain and Resilience in Cancer (ENRICh) study, a total of 515 English-speaking participants, aged 18 and older receiving ambulatory oncology care in the greater Houston metropolitan area, were surveyed from participating medical, surgical, or radiation oncology clinics between March 2019 and March 2020. This study cohort was a subgroup of a parent study of 628 patients. Overall response rate was 69.1%. Patients underwent this survey in an institutional review board approved protocol from the MD Anderson Cancer Center (IRB 2016-0391). Patients provided informed consent by reading a consent statement provided before the survey. There was a waiver of written consent. No minors were in the study.

Financial toxicity assessment
The 15-item ENRICh measure is a newly designed PROM to capture respondents' overall financial toxicity comprised from the dimensions of direct material burden, psychological burden, and depletion of coping resources [14,24], resulting from cancer and its treatment.
Each item is scored using an 11-point numerical rating scale with higher scores indicating increased financial burden at any point in the cancer trajectory. The median time from cancer diagnosis to survey was 267 days (IQR, 122.0, 535.5). It has acceptable reliability and validity for assessing cancer-related financial burden [14]. Consistent with the iterative nature of validation, we sought to assess the scale's advanced psychometric properties and suitability for CAT. each item, a multiple imputation approach was employed to handle the missing data by using predictive mean matching for numerical variables to reduce bias [25]. We used imputed dataset of 515 patients to conduct the following advanced psychometric analyses-IRT and CAT. Imputation was necessary for Mokken analysis. Of note, this large sample size is likely to cause type I error resulting in a significant p-value in the chi-squared (χ^2) test [26].
IRT analysis and CAT simulation. We first assessed the scale data's eligibility for conducting the IRT analysis, that is, whether it had met the specific assumptions of unidimensionality, scalability, and local independence of items, which determined whether item parameters could be calibrated successfully to further build item bank for subsequent CAT simulation conducting. During the assessment process, where needed, appropriate and necessary modifications were made to ensure the rigorous assumptions had met. We then conducted three CAT simulations at varied SEs of 0.32,0.45,0.55 and compared their performances. The specific principles and mechanisms with details for IRT analysis can be referred from somewhere else [17]. The detailed analysis processes for IRT and CAT in this study were summarized in S1 Text.

Software
We conducted all the IRT analyses with packages of "lavaan", "mokken", "mirt", "lordif". We simulated CAT using code derived from the Firestar package [27], and agreement with the fixed-length ENRICh tool using the "BlandAltmanLeh" package. All analyses were completed within the R Statistical software Version 4.1.1.

Unidimensionality test
Results of initial confirmatory factor analysis (CFA) in Table 1 showed that although all the factor loadings of included 15 items were greater than the threshold of 0.3, the fit statistics indicated a poor confirmatory model fit (χ^2, p<0.001; TLI = 0.74; CFI = 0.78; RMSEA = 0.15; RMSR = 0.08). Therefore, we conducted exploratory factor analysis (EFA) to further investigate the dimensional structure of the ENRICh measure. Parallel analysis suggested the existence of two components, however, factor analysis revealed only one factor with eigenvalue value greater than 1. As the second component was very weak with an eigenvalue of 1.50, and one dominant factor with an eigenvalue of 7.34 was apparent, we chose to proceed with a single factor structure for the remainder of the analyses.

Scalability of items
As polytomous items had more than 10 response categories each, Mokken analysis was inapplicable to help identify the unidimensional structure found from EFA results, or to evaluate the item homogeneity to test scalability assumption. Table 2 of DIF results showed eight uniform DIF items found for age group (2) and race group (6), and no DIF issue within gender. Slight differences were observed in trait distribution for age and race groups in Fig 1. Younger adults (<65) and non-white groups were likely to experience more severe financial hardship relating to cancer treatment than their respective counterparts. The magnitude of all DIF items was small with Pseudo R 2 ranging from 0.004 to 0.03, therefore, their impact was considered negligible.

IRT GRM results
All the F1 scores were greater than 0.60 indicating adequate loading (see Table 3). There 14 items had discrimination(a) higher than the threshold of 1.35 except item 15(a = 1.26) [28].
Furthermore, the item characteristic curve showed that all the 15 items with disordered threshold issues (see Fig 2). The histogram of each item verified this uneven distribution as well. Therefore, we addressed this issue by recoding the thresholds for all items.

Local independence of items test
Local independence of item assumption was reasonable for most items. However, item residual correlations among items 5 (ability to pay all of your bills) and 6 (ability to pay for food) (Q3 = +0.49), items 8 (ability to contribute to your normal household responsibilities and daily chores) and 13 (having someone to help with your normal household responsibilities and daily chores)(Q3 = +0.48), and items 14 (having someone to help care for the people who normally depend on you) and 15 (having help from community resources)(Q3 = +0.40) were higher than recommend a cutoff of +0.2. As lower information was provided by items 6, 13, and 15 compared with items 5, 8 and 14, respectively based on their item information curve, therefore, items 6, 13, and 15 were eliminated from the final round analysis below.

Analysis results of final round of IRT assumption test
After appropriate item modification and rescoring moves, the remaining 12 items were reanalyzed. Sufficient factor loadings were revealed and are demonstrated in parentheses in Table 1.  Table 4 indicated that both each item and whole scale had achieved sufficient scalability as all Loevinger's H coefficients were greater than 0.30. The results of reanalyzed IRT GRM are presented in Table 3 in parentheses. The number of thresholds for difficulties (b) reduced to 2 from 10. The collapsed thresholds displayed in Fig 3 demonstrated the disordered items issue had been resolved. The test information curve (Fig 4) showed that the most test information was concentrated on the theta(θ) of 1. The mean for factor scores of the whole scale was 0.0005 (sd = 0.95). The GRM achieved acceptable model fit based on the fit (TLI = 0.94; CFI = 0.95; RMSEA = 0.09; RMSR = 0.06). The revised ENRICh measure showed better psychometric properties than before (see S3 Table).

Results of CAT simulation
The results of three CAT simulations with varied stopping rules (SEs at 0.32, 0.45, and 0.55) are presented in Table 5. The lowest average number of items used during simulations was 2, whereas, the correlation of thetas derived from the CAT simulation and that from the fixed 12-item measure were as high as 0.98 when SE was set to 0.32.Items 2 (money you owe) and 4 (your stress level about finances) with most information were most frequently used during the CAT simulation (see Fig 5), followed by items 1 (money in your savings) and 5 (ability to pay bills). The factor scores obtained from items 2 and 4 only were closely correlated to those derived from the fixed 12-item measure (r = 0.85, p<0.001). After adding items 1 and 5, the factor scores of the 4-item ENRICh was much more closely associated with that of fixed 12-item measure (r = 0.96, p<0.001), as they provided 97.04% of item information at the theta range of (-2,+2) in Table 6. The agreement was evaluated between the CAT simulation and the fixed 12-item ENRICh measure using the Bland-Altman plot in Fig 6. The 95% limits of agreement ranged from -2.69 to +2.66 and only less than 5% of observations were outside this range.

Comparison among full ENRICh, CAT, and ENRICh-4
The basic information and comparison of participant scores among these three versions of ENRICh are displayed in Tables 7 and 8. The 4-item ENRICh was referred as ENRICh-4 version. The mean of participant scores for ENRICh-4 is highest (0.003, sd = 0.93); the root mean square deviation (RMSD = 0.31) of the participant scores between ENRICh-4 and ENRICh is largest. Fig 7 indicates that patients with higher-than-average levels of toxicity were most accurately measured using either the full ENRICh or ENRICh-4.

Main findings
Through advanced psychometric analysis of IRT, we developed a shorter version of the ENRICh measure as well as an efficient CAT version. Scores from both versions offer comparable scores to the full-length ENRICh. Applied in practice, these options are intended to reduce respondent burden and yet still provide an efficient means of identifying high risk patients needing intervention for financial toxicity. Simulated CAT version provides a novel option to improve efficiency and accuracy of PROM [17,29]. The 4-item short version provides an option to minimize administrative burden in settings where specific items to assess the broad range of dimensions of financial toxicity are not required. Moreover, the high reliability of this ENRICh measure (α = 0.92) derived from this American participant-focused study makes it more suitable to be utilized in the American clinical setting.
In this study, we explore a unidimensional factor structure for the ENRICh PROM, which has also been evaluated as a multidimensional measure [14]. While factor analytic methods only displayed moderate dimensionality, Mokken analysis demonstrated appropriate scaling along a single dimension. In the item dependency assumption test, items 6 (ability to pay for food), 13 (having someone to help with your normal household responsibilities and daily chores), and 15 (having help from community resources) showed strong residual correlations. Many other plausible reasons could contribute to the above-mentioned distress, not closely associated with cancer  treatment for patients. Therefore, they were all eliminated from further analyses due to lower information provide to this scale. This study also ascertained eight uniform DIF items for age and race groups, which indicates that certain items are interpreted differently by different demographic groups [29]. Their uncrossing plots reflected that the demographic differences in these items are consistent along with the severity level of the financial toxicity continuum. Although the effect of the DIF items was not meaningful using the cut-off we had adopted for this study, we note that there 4 of 8 DIF items with beta change (β) were greater than the recommended cutoff point of 5% by other researchers [30].  During the CAT simulation, we note items 1 (money in your savings), 2 (money you owe), 4 (stress level about finances), and 5 (ability to pay bills) were the most frequently used. These items may be indicative of depleted coping resources and entering into a phase of increased financial toxicity, which is consistent with prior studies [24,31]. The strong correlation between factor score of items 1,2,4,5 and that of fixed 12-item scale makes the use of ENRICh-4 version possible. This ultra-short ENRICh version may provide a quick and convenient assessment of a unidimensional cancer patients' financial burden for health care providers who prioritize brevity over assessment reliability in some circumstances. In addition, the ENRICh-4 version is well suited for screening purposes. For researchers and investigators who  are interested in further understanding the dimensionality of financial toxicity, the full ENRICh is recommended. Additionally, the available computer-adaptive measurement delivery platform-Concerto -has demonstrated the ability to move the transformative technology toward real clinical practice and research [32]. We will foresee that CAT implications will promote truly patientcentered care.

Limitations
Some limitations in this study are summarized below. First, future research could generalize these results of the multi-institutional study to patients receiving care beyond the Houston metro area. Second, the precision estimation of an underlying trait in CAT simulation is slightly limited by the relatively small number of items of ENRICh [29]. Third, the performance of its application into different countries, languages, as well as the cross-cultural difference warrant further investigation. Fourth, the observed negative residual correlation among some items indicates the possibility of multidimensionality existence, suggesting the need for multidimensional CAT simulation to alleviate the controversy of the fairly weak dimensionality of ENRICh by incorporating additional information of items [33,34].

Conclusion
This study shows that new short-form and adaptive versions of the ENRICh financial toxicity measure have acceptable psychometric properties, reduced redundancy, and simplified item response options through performing advanced psychometric analysis. Without sacrificing precision, the CAT version of ENRICh overperformed its fixed-length version in terms of the number of item administrations. The developed CAT version and ultra-short version containing four items alone are efficient screens for the severity of potential financial toxicity experienced by cancer patients, and also promote timely guidance and intervention provided to targeted populations.
Supporting information S1